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Abstract 

We present a mathematical framework for constructing and analyzing parallel 
algorithms for lattice Kinetic Monte Carlo (KMC) simulations. The resulting 
algorithms have the capacity to simulate a wide range of spatio-temporal scales 
in spatially distributed, non-equilibrium physiochemical processes with complex 
chemistry and transport micro-mechanisms. The algorithms can be tailored to 
specific hierarchical parallel architectures such as multi-core processors or clus- 
ters of Graphical Processing Units (CPUs). The proposed parallel algorithms 
arc controUcd-error approximations of kinetic Monte Carlo algorithms, depart- 
ing from the predominant paradigm of creating parallel KMC algorithms with 
exactly the same master equation as the serial one. 

Our methodology relies on a spatial decomposition of the Markov operator 
underlying the KMC algorithm into a hierarchy of operators corresponding to 
the processors' structure in the parallel architecture. Based on this operator de- 
composition, we formulate Fractional Step Approximation schemes by employ- 
ing the Trotter Theorem and its random variants; these schemes, (a) determine 
the communication schedule between processors, and (b) are run independently 
on each processor through a serial KMC simulation, called a kernel, on each 
fractional step time-window. 

Furthermore, the proposed mathematical framework allows us to rigorously 
justify the numerical and statistical consistency of the proposed algorithms, 
showing the convergence of our approximating schemes to the original serial 
KMC. The approach also provides a systematic evaluation of different processor 
communicating schedules. We carry out a detailed benchmarking of the parallel 
KMC schemes using available exact solutions, for example, in Ising-type systems 
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and we demonstrate the capabilities of the method to simulate complex spatially 
distributed reactions at very large scales on GPUs. Finally, we discuss work 
load balancing between processors and propose a re-balancing scheme based on 
probabilistic mass transport methods. 

Keywords: Kinetic Monte Carlo method, Parallel Algorithms, Markov 
semigroups. Operator Splitting, Graphical Processing Unit (GPU) 



1. Introduction 

Kinetic Monte Carlo algorithms have proved to be an important tool for the 
simulation of out-of-equilibrium, spatially distributed processes. Such models 
arise in physiochemical applications ranging from materials science and catal- 
ysis, to complex biological processes. Typically the simulated models involve 
chemistry and/or transport micro- mechanisms for atoms and molecules, e.g., re- 
actions, adsorption, desorption processes and diffusion on surfaces and through 
complex media, [l9l . [^, [J. Furthermore, mathematically similar mechanisms 
and corresponding Kinetic Monte Carlo simulations arise in agent-based, evo- 
lutionary games problems in epidemiology, ecology and traffic networks, [35|. 

The simulation of stochastic lattice systems using Kinetic Monte Carlo (KMC) 
methods relies on the direct numerical simulation of the underlying Continu- 
ous Time Markov Chain (CTMC). Since such stochastic processes are set on a 
lattice (square, hexagonal, etc.) An with N sites, they have a discrete, albeit 
high-dimensional, configuration space S and necessarily have to be of jump type 
describing transitions between different configurations ct G S. Mathematically, 
CTMC are defined in terms of the transition rates c{x,uj]<7) which correspond 
to an updating micro-mechanism that describes completely the evolution of the 
stochastic process as a transition from a current configuration a of the system 
to a new configuration ct^^" by performing an update in a neighborhood of the 
site X € An- In other words the probability of a transition over an infinitesimal 
time interval 6t is P (Stst ~ cr^'" \ St ^ cr) = c(.t, w; <7)6t + o{St^). In turn, the 
transition rates define the total rate 

H<^) XI H c{x,u;;a), (1) 

which is the intensity of the exponential waiting time for a jump to be performed 
when the system is currently at the state a. Here uj G Sx, where Sx is the set 
of all possible configurations that correspond to an update at a neighborhood 
of the site x. Once this exponential "clock" signals a jump, then the system 
transitions from the state cr to a new configuration tr^''^ with probability 

Thus the full stochastic evolution is completely defined. We refer to the discus- 
sion in Section [2] for a complete mathematical description of the KMC method. 



2 



The implementation of this method is based on efficient calculation of ((T]) and 
([2]), and was first developed in Ira, known as a BKL Algorithm, for stochas- 
tic lattice Ising models, and in known as Stochastic Simulation Algorithm 
(SSA) for reaction systems. However, as it is evident from formulas ([T]) and ([2]), 
the algorithms are inherently serial as updates are done at one site a; S Ajv at 
a time, while on the other hand the calculation of ([T|) depends on information 
from the entire spatial domain An- For these reasons it seems, at first glance, 
that KMC algorithms cannot be parallelized easily. 

However, Lubachevsky, in [22j, proposed an asynchronous approach for par- 
allel KMC simulation in the context of Ising systems, in the sense that different 
processors simulate independently parts of the physical domain, while inconsis- 
tencies at the boundaries are corrected with a series of suitable rollbacks. This 
method relies on uniformization of the total rates over each processor, see also 



13[ for the use of uniformization in the parallel simulation of general CTMC. 



Thus the approach yields a null-event algorithm, [19[, which includes rejected 
moves over the entire domain of each processor. Furthermore, Lubachevsky 
proposed a modification in order to incorporate the BKL Algorithm in his par- 
allelization method, which was implemented and tested in [17[. This is a par- 
tially rejection-free (still asynchronous) algorithm, where BKL-type rejection- 
free simulations are carried out in the interior of each processor, while uniform 
rates were used at the boundary, reducing rejections over just the boundary set. 
However, in spite of the proposed improvements, these asynchronous algorithms 
may still have a high number of rejections for boundary events and rollbacks, 
which considerably reduce the parallel efficiency, [H] . Advancing processors in 
time in a synchronous manner over a fixed time-window can provide a way to 
mitigate the excessive number of boundary inconsistencies between processors 
and ensuing rejections and rollbacks in earlier methods. Such synchronous par- 
allel KMC algorithms were proposed and extensively studied in 
However, several costly global communications are required at each cycle be- 
tween all processors, whenever a boundary event occurs in any one of them, in 
order to avoid errors in the inter-processor communication and rollbacks, |28|. 

As we will discuss further in this paper, many of the challenges in paral- 
lel KMC can be addressed by abandoning the earlier perspective on creating a 
parallel KMC algorithm with the exactly same rates (and hence the generator 
and master equation) as the serial algorithm, see [ISl for a discussion on exact 
algorithms. This is a very natural idea in the numerical analysis of contin- 
uum models such as Ordinary and Partial Differential Equations (ODE/PDE). 



First, in |34| the authors propose an approximate algorithm, in order to cre- 



ate a parallelization scheme for KMC. It was recently demonstrated [28|, |J|, 
that this method is very promising: boundary inconsistencies are resolved in a 
straightforward fashion, while there is an absence of global communications in 
contrast to synchronous relaxation schemes discussed earlier. Finally, we note 
that, among the parallel algorithms tested in (28| , the approximate algorithm 
had the highest parallel efficiency. 

Here we develop a general mathematical framework for parallelizahle ap- 
proximations of the KMC algorithm. Our approach relies on first developing a 
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spatial decomposition of the Markov operator, that defines the Kinetic Monte 
Carlo algorithm, into a hierarchy of operators. The decomposition is tailored 
to the processor architecture. Based on this operator decomposition, we for- 
mulate Fractional Step Approximation schemes by employing the Trotter prod- 
uct formula. In turn these approximating schemes determine Communication 
Schedule between processors through the sequential application of the operators 
in the decomposition, and the time step employed in the particular fractional 
step scheme. Here we discuss deterministic schedules resulting from Lie- and 
Strang-type fractional step schemes, as well as random schedules derived by the 
Random Trotter Theorem, [3]. We show that the scheme in [s^ is a particu- 
lar case of a random schedule and can be mathematically analyzed within the 
proposed framework. We recall that the deterministic Trotter Theorem was first 
proved in (sij for the approximation of semigroups corresponding to operator 
sums, and it has found wide application in the numerical ODE/PDE analysis, 



121. 



e.g. 

In Section [5] we show that the Fractional Step KMC schemes allow us to 
run independently on each processor a serial KMC simulation (called a kernel ) 
on each fractional time-step window. Furthermore, processor communication is 
straightforward at the end of each fractional time-step while no global commu- 
nications or rollbacks are involved. In Section [5] we show that the hierarchical 
structure of our methodology can be easily implemented for very general phys- 
iochemical processes modeled by lattice systems, allowing users to input as the 
algorithm's KMC kernel their preferred serial algorithm. This flexibility and 
hierarchical structure are key advantages for tailoring our framework to partic- 
ular parallel architectures with complex memory and processor hierarchies, e.g., 
clusters of CPUs. 

The proposed mathematical framework allows us to rigorously prove the 
numerical and statistical consistency of the proposed algorithms, while on the 
other hand it provides a systematic evaluation of different processor communi- 
cation schedules. Indeed, in Section [3] the numerical and statistical consistency 
of the proposed algorithms is rigorously justified by the Trotter Theorem, [36[ , 
[l2j showing the convergence of our approximating schemes to the original serial 
KMC algorithm, interpreted as convergence to the underlying Markov opera- 
tor. Using the Random Trotter Theorem [l8| we show that the approximation 
schemes with a random schedule, including the one in [34| as a special case, are 
numerically consistent in the approximation limit; that is, as the time step in 
the fractional step scheme converges to zero, it converges to a continuous time 
Markov Chain that has the same master equation and generator as the original 
serial KMC. In Section |4] we show that the proposed mathematical framework 
can allow the study of controUed-error approximation properties of Fractional 
Step KMC schemes, as well as the systematic evaluation of different processor 
communicating schedules, comparing for instance the scheme in jsj] to the Lie 
scheme P^ . 

Finally, in Section [6] we discuss work- load balancing between processors and 
propose a re-balancing scheme based on probabilistic mass transport methods, 
[10|, which is particularly well-suited for the proposed fractional step KMC 
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methods. In Section[7]we present detailed benchmarking of the proposed parallel 
algorithms using analytically available exact solutions, for instance, in Ising-type 
systems and demonstrate the capabilities of the method to simulate complex 
spatially distributed molecular systems, such as CO oxidation on a catalytic 
surface. 

2. EVactional Step Kinetic Monte Carlo Algorithms 

Wc first present the mathematical background of KMC in a more abstract 
way in order to demonstrate the generality and the applicability of the proposed 
method. We consider a o?-dimensional lattice Ajv with N lattice sites. We 
restrict our discussion to lattice gas models where the order parameter or the 
spin variable takes value in a finite countable set S = {0, 1, . . . , K}. At each 
lattice site x G A^r an order parameter (a spin variable) <7{x) £ S is defined. 
The states in E correspond to occupation of the site x G A^v by different species. 
For example, if E = {0, 1} the order parameter models the classical lattice gas 
with a single species occupying the site x when cr(x) = 1 and with the site being 
vacant if (j{x) — 0. We denote {St}t>o the stochastic process with values in the 
configuration space S = E"^" . 

Our primary focus is on modeling the basic processes of adsorption, desorp- 
tion, diffusion and reactions between different species. Thus the local dynamics 
is described by a collection of the transition rates c{x,uj;a) and by an updat- 
ing mechanism such that the configuration a of the system changes into a new 
configuration tr^''^ by an update in a neighborhood of the site x G A^r. Here 
w G iSa; , where Sx is the set of all possible configurations that correspond to an 
update at a neighborhood fix of the site x. For example, if the modeled process 
is a diffusion of the classical lattice gas a particle at x, i.e., <j{x) can move to 
any nearest neighbor of x, i.e., i^x = {y ^ Ajv | |a; — y| = 1} and Sx is the set of 
all possible configurations Sx = E^== . In other words the collection of measures 
c{x, uj; a) defines the transition probability from a to a^''^ over an infinitesimal 
time interval St. More precisely, the evolution of the system is described by a 
continuous time Markov jump process with the generator C : Cb{S) — > Cb{S) 
acting on continuous bounded test functions / G Cb{S) according to 

We recall that the evolution of the expected value for an arbitrary observable 
/ G Cb{S) is given by the action of the Markov semigroup e*^ associated with 
the generator C and the process {5t}(>o 

(e*^Mo,/)=lEso[/(^t)], (4) 

where un is the initial distribution of the process, i.e. of the random variable 
5*0, [20|. Practically, the sample paths {St}t>o are constructed via KMC, that 
is through the procedure described in ([T]) and ([2]). 
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To elucidate the introduced notation we give a few examples relevant to 
the processes modeled here. We refer, for instance, to [l^, H, 0] for a complete 
discussion of the physical processes. 

Examples. 

1. Adsorption/ Desorption for single species particles. In this case spins take 
values in cr{x) e S = {0,1}, 51^ = {a;}, Sx = {0,1} and the update 
represents a spin flip at the site x, i.e., for z g A^r 



2. Diffusion for single species particles. The state space for spins is (j{x) € 
S = {0, 1}, fix ^ {y ^ Aat I |x — ?/| = 1} includes all nearest neighbors 
of the site x to which a particle can move. Thus the new configuration 
^x,uj ^{x,y) jg obtained by updating the configuration St = <J from the 
set of possible local configuration changes {0, 1}^^ using the specific rule, 
also known as spin exchange, which involves changes at two sites x and 
y 



The transition rate is then written as c{;x,LiJ\a) = c{x,y;a). The result- 
ing process {St}t>o defines dynamics with the total number of particles 
(X^xeAiv '^i^)) conserved, sometimes referred to as Kawasaki dynamics. 

3. Multicomponent reactions. Reactions that involves K species of particles 
are easily described by enlarging the spin space to S = {0, 1, . . . , K}. If 
the reactions occur only at a single site x, the local configuration space 
Sx ~ ^ and the update is indexed by fc G S with the rule 



The rates c{x,uj]a) = c{x,k;a) define probability of a transition a{x) to 
species k = 1, . . . , K or vacating a site, i.e., fc = 0, over St. 

4. Reactions involving particles with internal degrees of freedom. Typically 
a reaction involves particles with internal degrees of freedom, and in this 
case several neighboring lattice sites may be updated at the same time, 
corresponding to the degrees of freedom of the particles involved in the 
reaction. For example, in a case such as CO oxidation on a catalytic 
surface, [2l[, when only particles at a nearest-neighbor distance can react 
we set a{x) G E = {0, 1, . . . , K}, ftx = {y & I I^; — 2/| = 1} and the set 
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of local updates Sx = S . Such Sx contains all possible reactions in a 
neighborhood of x. When reactions involve only pairs of species, the rates 
can be indexed hy k, I G S. or cquivalently iSa; = S x E. Then the reaction 
rate c{x,uj;a) = c{x,y^ k,l; a) describes the probability per unit time of 
(j{x) — ?> fc at the site x and cr(y) — ^ / at y, i.e., the updating mechanism 

{cr(z) iiz^x,y, 

k iiz^x, 

I \{ z = y, 

where \x — y\ = 1. 



2.1. Hierarchical structure of the generator 

The generator of the Markov process {St}t>o given in a general form in ([3]) is 
our starting point for the development of parallel algorithms based on geometric 
partitioning of the lattice. The lattice Ajv is decomposed into non-overlapping 
cells Cm , in = 1, . . . , AI such that 

M 

An= \J Cm, C™ n C„ = , m 7^ n . (5) 



With each set Cm a larger set Cm is associated by adding sites to Cm which 
are connected with sites in Cm by interactions or the updating mechanism, see 
Figure 1(a) More precisely, we define the range of interactions L for the set 
Cm and the closure of this set 

Cm = {z £ \ \z — x\ < L ,x G Cm.} , whcrc L = max {diam fix} . 

In many models the value of L is independent of x due to translational invariance 
of the model. The boundary of Cm is then defined as dCm — Cm H Cm- This 
geometric partitioning induces a decomposition of the generator ([3]) 



^/(^) = E E cix,co;a)[f{a^n - f{a)] (6) 

M 

= E E E ^(^'^'^)[/('^''")-/(^)] (7) 

m=l x£C,n i^S<Sx 
M 

= Lm!{o) . (8) 



The generators Lm define new Markov processes {S'™}t>o on the entire lattice 

Aat. 

Remark: In many models the interactions between particles are of the two- 
body type with the nearest-neighbor range and therefore the transition rates 
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c{x, uj; a) depend on the configuration a only through a{x) and a{y) with \x — 
y\ = 1. Similarly the new configuration cr^''^ involve changes only at the sites 
in this neighborhood. Thus the generator Cm updates the lattice sites at most 



in the set Cm = {z\\x ~ z\ — 1 ,x € Cm}, see Figure 1(a) Consequently the 
processes {S™}t>o and {S"™ }t>o corresponding to Cm and Cm' are independent 
provided Cm H Cm' = 0- 

Therefore, splitting ([6]) allows us to define independent processes which yields 
an algorithm suitable for parallel implementation, in particular, in the case 
of short-range interactions when the communication overhead can be handled 
efficiently. If the lattice A^r is partitioned into subsets Cm such that the diameter 
diamCm > L, where L is the range of interactions, we can group the sets 
{Cm}m=i ill such a way that there is no interaction between sites in the sets 
Cm that belong to the same group. For the sake of simplicity we assume that 
the lattice is divided into two sub-lattices described by the index sets and 



I*^, (black vs. white in Fig. 1(a) I, hence we have 



A^.==AfuA^:= IJ C^U \J C^- (9) 

Other lattice partitionings are also possible and may be more suitable for specific 
micro-mechanisms in the KMC or the computer architecture. Choice of the 



partitioning scheme can reduce communication overhead, see for instance [34 1 



For the sake of simplicity in the presentation, here we consider the partitioning 



depicted in ^ and Fig. 1(a) although our mathematical framework applies to 
any other sublattice decomposition. Returning to 0, the sub-lattices induce a 
corresponding splitting of the generator: 

C^C^ + CO:= Y: E ^?n- (10) 

This simple observation has key consequences for simulating the process {St}t>o 
in parallel, as well as formulating different related algorithms: the processes 
{S^}t>o corresponding to the generators Cm are mutually independent for dif- 
ferent m £ , and thus can be simulated in parallel; similarly we can handle 
the processes belonging to the group indexed by . However, there is still com- 
munication between these two groups as there is non-empty overlap between the 
groups due to interactions and updates in the sets dCm, dC'm when m G and 
m! G X*^ and the cells are within the interaction range L. To handle this com- 
munication we next introduce a Fractional Step approximation of the Markov 
semigroup e*^ associated with the process {S't}t>o- 

2.2. Fractional Step Kinetic Monte Carlo Algorithms 

The deterministic Trotter Theorem was first proved in (sij for the approx- 
imation of semigroups corresponding to operator sums, and it has found wide 



application in the numerical ODE/PDE analysis, e.g., [12|. Similarly, the key 
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Figure 1: (a) Lattice decomposition in ^ using the clieckerboard scheme mapped onto a single 
multi-threading processing unit (e.g., GPU). The integer cell coordinates also indicate com- 
munication through boundary buffer regions. In practice other partitionings may result in a 
lower communication overhead, (b) Hierarchical lattice partitioning on a cluster of processing 
units. 



tool for our analysis is a deterministic as well as a stochastic version of the 



Trotter formula, [l8[ , applied to the operator £ = + C'~' 

(11) 



e*^ = lim 



The proposed parallel scheme uses the fact that the action of the operator 
(and similarly of C'^) can be distributed onto independent processing units. 
Thus to reach a time T we define a time step At = — for a fixed value of n and 
alternate the evolution by and C'-' . More precisely, ((TT|) gives rise to the Lie 
splitting approximation for n 3> 1: 



(12) 



Since the simulated systems exhibit short-range interactions, the generators 
, Cf commute for fc, I G X^, k ^ I: 



Cj^Cf - CfCj? ^0, iov al\k,l&X'^,k^l. 



Hence, 36[, we have the exact formula 

Then the expression ([T3| implies that the KMC solvers corresponding to the 
semigroup e^*^ (resp. e^*'~ ) can be simulated exactly by breaking down 
the task into separate processors/threads for each m G (resp. m e I*^). 
Therefore, this scheme allows us to run independently on each fractional time- 
step window Ai, and on every processor, a serial KMC simulation, called a 
kernel. The resulting computational framework consisting of the hierarchical 
decomposition ()10p and (fT2|) permits to input as the algorithm's kernel any 
preferred optimized serial KMC algorithm. 

A single time step of the parallel algorithm is thus easily described in the 
following stages: 

Step 1-Evolution by C^: Simulate independent Markov processes {S'J"}t>o, 
m S by a kinetic Monte Carlo kernel running on non-communicating 
processors that correspond to each Cm for time At. 

Step 2 Local Synchronization: communicate configurations from over- 
lapping domains C*,^ n in order to update configurations a'^ . 

Step 3-Evolution by C'^: Simulate independent Markov processes {S'™}(>o, 
m £ X*^ by a KMC kernel on non-communicating processors that corre- 
spond to each Cm for time At. 

We emphasize that the resulting process {St\t>o is an approximation of the 
process {St}t>Q and we discuss it's features and properties in the next two 
sections. 



10 



3. Processor Communication Schedule and Random Trotter Products 



A key feature of the fractional step methods is the Processor Communication 
Schedule (PCS) that dictates the order with which the hierarchy of operators in 
([6]) are apphcd and for how long. For instance, in (jl2p the processors correspond- 
ing to (resp. L'-') do not communicate, hence the processor communication 
within the algorithm occurs only each time we have to apply e^s^^ or earr-^^. 
Therefore we can define as the PCS the (deterministic) jump process X = X{t), 
t G [0,T], where [0,T] is the simulated time window and taking values in the 
set X = {1,2}, where we assign the value 1 (resp. 2) to O (resp. E): 



2kT {2k + l)T 



Xit) = 2, i^k + l)T^^^i2k + 2)T 



(14) 
(15) 



for all fc = 0, . . . , n — 1. Processor communication occurs at jump times, while 
in the remaining time the processors operate independently and do not com- 
municate. In an analogous way we can define the PCS for the Strang splitting 
scheme 

e^^ w e^^ e^^ e^^ , (16) 
with the scheduling process 

X{t) = l, 
X{t) = 2, 
Xit)^l, 
for all fc = 0, . . . , n — 1. 



2kT {2k+l)T 



2n 

{2k + l)T 
2n 

{2k + 3)T 
2n 



2n 



< t < 

< t < 



{2k + 3)T 
2n 

(2fc-H4)T 
2n 



(17) 
(18) 
(19) 



3.1. Random Fractional Step Methods 

In both cases above (|12p and ([TB| . the communication schedule is fully de- 
terministic, relying on the Trotter Theorem (|lip . On the other hand, we can 
construct stochastic PCS based on the Random Trotter Product Theorem, and 



as we show below the sub-lattice algorithm proposed in j34| is a fractional step 
algorithm with stochastic PCS. 

The Random Trotter Product Theorem, extends PT|) as follows: We 
consider a sequence of semigroups e^'^'S with corresponding operators £5 where 
^ is in the index set X, assuming for simplicity X is finite, although a much 
more general setting is possible, (|25|) . Consider also a stochastic jump process 
X = X{t) with X as its state space. For each of its trajectories we denote by 
£,0,^1, ■■■^71 the (typically random) sequence of states visited by the stochastic 



11 



process X(t) and tq, ti, . . . , t„ the corresponding (also typically random) jump 
times 

X{t) = Co , < t < To , (20) 

X{t) = , To < i < Tl , (21) 

(22) 

X{t) = Cfc , ^fe-i <t<rk. (23) 

We additionally define as N{t) the number of jumps up to time t. We assume 
that X{t) is selected so that it has an ergodic behavior, i.e., there is a probability 
measure /i(d^) such that for all bounded functions g we have that 

lim i /* giX{s)) ds^ f 5(0 A^(dC) . (24) 

t^OO t Jq J 

For example, if X{t) is a Markov process then under suitable conditions, (|24| 
will hold, where will be the stationary distribution of X(t), Conversely, it 
is well-known that for a given /i we can construct in a non-unique way Markov 
processes X(t) which satisfy the condition (P^ . [20j . Now we can state the 
Random Trotter Product Theorem, [T3|, in analogy to (|lll) : 



e"'"''' = lim 

n— >oo 



(25) 



where the operator C is defined on any bounded function as 

Cg^ I C^fiidO . (26) 

It is clear that ([T^ is a special case of (|25p when — r^-i = 1 and ^2fc = 1, 
£,2k+i = 2 for all k. Similarly, we can also view (jl6p as a deterministic analogue 
of dMI). 

On the other hand, in the context of the parallel fractional step algorithms 
for KMC introduced here, the random process ([20| can be interpreted as a 
stochastic PCS. For example, the sub-lattice (SL) parallelization algorithm for 



KMC, introduced in [3J|, is a fractional step algorithm with stochastic PCS: 
indeed, in this method the lattice is divided into sub-lattices, for instance as 
in A]y ~ U A^. Each sub- lattice is selected at random and advanced 
by KMC over a fixed time window At. Then a new random selection is made 
and again the sub-lattice is advanced by At, and so on. The procedure is 
parallelizable as cells C^, within each sub- lattice do not communicate. This 
algorithm is easily recast as a fractional step approximation, when in (|20[) we 
select deterministic jump times Tk and random variables £,k- 

Th^^hzl = M, and P{^k^l) = P{ik=2) = \. (27) 
n A 
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As in here we assign the value 1 (resp. 2) to the O (resp. E) sub-lattice. 
Furthermore, we can easily calculate (|26)) to obtain 



which is just a time rcscaling of the original operator C. Thus the SL algorithm 
is rewritten as the fractional step approximation with the stochastic PCS p7|) 
as 

gT£ ^ ^i^c, . . . e ^..(„t) . (28) 

From the numerical analysis viewpoint, our re- interpretation of the SL algorithm 
in [s^l as a fractional step scheme allows us to also provide a mathematically 
rigorous justification that it is a consistent estimator of the serial KMC algo- 
rithm, due to the Random Trotter Theorem (|25p . That is, as the time step in 
the fractional step scheme converges to zero, it converges to the continuous time 
Markov Chain that has the same master equation and generator as the original 
serial KMC. Finally, the (deterministic) Trotter Theorem pT|) also implies that 
the Lie and the Strang schemes are, in the numerical analysis sense, consistent 
approximations of the serial KMC algorithm. 



4. Controlled Error Approximations of KMC 

In this section we present a formal argument for the error analysis of the 
fractional step approximations for KMC, which suggests the order of conver- 
gence of the schemes, as well as the restrictions on the Fractional Step KMC 
time step At. In the decomposition (|10p the operators are linear operators on 
the high, but finite-dimensional configuration space 5, hence by the standard 



error analysis of splitting schemes, see [12|, we have 



gAt£ _ gAt£-gAt£0 ^ [^B^O _ ^O^B] ^ ^(^^3) ^ (29) 

where we readily see that the term [C^ , C'-'] := CP — L'^ is the Lie bracket 
(commutator) of the operators CP . This Lie bracket captures the effect of 
the boundary regions C^CiC^ through which we have processor communication: 
if there was no communication the Lie bracket would be exactly zero. 

Furthermore, instead of (fT2|) we can consider the Strang-type splitting ([T6|) . 



As in the ODE case, [12|, this is expected to yield a higher order error term 



0{At^) instead of the second order approximation in ([29]), in the following sense: 

[£^,£0] ]}(At)3 + 0{At^) . (30) 

Such calculations suggest that the Strang splitting leads to a more accurate 
scheme, which is balanced by more complicated boundary local communication 
in the same time window At, as is evident when comparing (|12p and (|16p . 
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Next, we briefly comment on the error estimation suggested by the calcula- 
tion p9| and return to the rigorous numerical analysis in [l| . In order to obtain 
an estimate in the right-hand side of (PH]) which is independent of the system 
size N, it is essential to obtain an upper bound on the total number of jumps 
up to the time T. This is a key point related to the extensivity of the system 
and to the fact that the weak error analysis is restricted (as it should be physi- 
cally) to mesoscopic observables satisfying (|44)) . Wc observe the dependence of 
the error on mesoscopic observables in the following subsection. In the context 
of coarse-graining, in [15| an analogous estimate was shown rigorously using a 
Bernstein- type argument applied to the discrete derivatives, in the spirit of 
of the solutions to the backward Kolmogorov equation. We refer to such bounds 
as "Bernstein-like" due to their similarity to gradient estimates for linear and 
nonlinear parabolic PDEs. 



Error Analysis and comparison between random and deterministic PCS 
In this section wc further demonstrate the use of the operator splitting for- 
mulation as a numerical analysis tool by comparing the time-step of At the 



random PCS introduced in [SJ] to the deterministic Lie PCS introduced in 
. A similar comparison can be made for the Strang scheme ([TB| . A detailed 
discussion including rigorous error estimates for mesoscopic observables such as 
(|44|). which are independent of the lattice size N will be discussed in [l[. 

Here we focus on the example of adsorption/desorption discussed in Section 
[21 The generator in the one space dimension is decomposed as in (|10p 



and 



where 



^E(^„\_) cix,a), xGAf o/^ „N _ / c(x,cr), x & K% 



[0, otherwise [0, otherwise 

and the sub-lattices A^,A^ are defined in ([9]). The rates c{x,a) of the cor- 
responding generator ([3]) for the case of Arrhenius adsorption/desorption are 
given by 

c{x, a) = Ca(l - cr(x)) + CdO^x) exp ( - l3U{x, a)) , (31) 

where Ca and Cd are the adsorption and desorption constants respectively, 0- 
The desorption potential U = U{x,a) is defined as 

U{x,a) = Y,J{x-y)a{y), (32) 

where J = J{x — y) is the lateral interaction potential; for simplicity we assume 
that the range of interactions is L, while in typical simplified nearest neighbor 
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models L = 1. Similarly we define diffusion dynamics with Arrhenius dynamics, 



First we discuss the error analysis for the Lie splitting scheme. For given 
finite lattice size N, in the decomposition ([TU| the operators are linear oper- 
ators on the high, but finite-dimensional configuration space 5, hence by the 
standard error analysis of Lie splitting schemes, we obtain (|29p . A more careful 
study of the commutator reveals that the generator decomposition (|10[) induces 
significant cancellations in the evaluation of the generator: indeed, we define 



C°j — Cm \ dC„ 



c° u c£, 



where in Section[5]we introduced dCm — Cm^Cm and Cm = {z E Ajy \ \z — x\ < 
L ,x € Cm} ■ Thus, in (jlOp we obtain the further decomposition 



c 



E,d 



c. 



E,d 



(33) 



where C^^'°,C^'^ is the restriction of on and respectively. Analo- 
gously we define jC'~' = C'-^'° + £'~''^. We now return to the evaluation of the 
commutator 



-E.d qO,o 



However, due to the lack of communication between generators beyond the 
interaction range, we have that 



0. 



thus we readily get 



0, 



? — m I — 1 



E,d rO,d^ 



(35) 



The formula p5p captures the processor communication between boundary re- 
gions of Cm^ C^ . But more importantly, when combined with (\29^ . it suggests 
the limitations on the time window At of the Lie scheme p2|) , denoted for dif- 
ferentiation by Atj^jg, in order to obtain a given error tolerance TOL. In that 
sense it is useful to obtain an upper bound on psp . Indeed, we readily obtain: 



|i-m| = l 



E 



c^iy,a) c''{x,a)-c^{x,ay) f{ay) 



(36) 
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where all summations are over x S C^'^,?/ € Cf''^. For mesoscopic observables, 
such as the mean coverage f{a) = X^xga"'!^) obtain 



li-ml = l 



1 - 2cr(a;) 



TV 

1 - My) 

N 

(37) 



where all summations are over x G C^'^,?/ e Cf^' . Therefore, due to the 
cancellation of all interior components C^ °,C^'° in (|35[) . we obtain the bound 
for the case of the interaction range L = 1, 

|[/:^,/:«]/(.)|^o(^)=o(i), (38) 

where q is the size of each cell Cm, and 0(1) depends on the physical parameters 
in the rate (|3T|) . The local error analysis in (|29|) . ((38)) can be propagated up to a 
prescribed time T = A^]^jj,Atj^jg Therefore, for the simulation of the mesoscopic 
observable / up to the time T within a given error tolerance TOL, and 
psp give the o&serua&Ze-dependent relation for the Lie time step 

TOL ~ T . |[/:^, £0]/(a)|AtLie ^ T ■ o(l) Atj^ig (39) 

Next, using the fractional step formulation, we analyze in the same spirit 
as for the Lie scheme, the random PCS ((27| proposed in For notational 

simplicity we set Ai = C'-' , A2 = jC^ . Then the local error operator E'^^ can 
also be calculated as in (l29l): 



Local Error = S^* .^^^tA,, ^AtA,, _ ^At(A,+A,) 

I + [A^, + A^,)^t + ]^{Al + 2Ai^,A^, + Al)M'' 

/ + (Ai + ^2)At + i(Ai + A^fM"-^ + 0{At^) (40) 

The mean value of the error over the sequence of independent random variables 
^ = (Ci J* = 1: ■•■) '^) of the PCS ((27| on an observable / = /(cr) , s g 5 can be 
explicitly evaluated: 

E^[E^'f] = i(Ai - A2ffAf + 0(At3) ^ 1(/:^ - C^ffAf + 0(At3) . 

As in (pS]) . for the mesoscopic observable /(cr) = X^a-eA '''(2^)1 '^'^ obtain, after 
disregarding the higher order local error 0(Ai'^), 

(£^-£0)2/(^)^0(1), (41) 
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where 0(1) depends on the physical parameters in the rate ([5T|). Similarly to 
(|39|) , for the simulation of the mesoscopic observable / up to the same prescribed 
time T — ^j^andom^^Random' ''^ithin the same error tolerance TOL, ([2^ and 
(|4ip give the o&seruafele-dcpcndcnt relation for the random PCS time step 



TOL ^ T . \{C^ - /:°)V(^)|AtRa,,dom ^ T ■ 0(l)AtR^^dom (42) 



Comparing the random and the Lie PCS through (|39p and (|^^ implies that in 
order the two schemes to conform (in the mean) to the same tolerance TOL, 
their respective time steps should be selected so that 



This relation in turn suggests that the Lie scheme is expected to parallelize 
better than the random PCS ((27)) since it allows a g-times larger time step At 
for the same accuracy, keeping in mind that during each time step processors 
do not communicate. 

A similar analysis is possible for general mesoscopic observables / ~ /(c) , s € 
S, e.g., spatial correlations, that satisfy 



where C is a constant independent of N, see the formulation and estimates for 
coarse-grained stochastic systems in [l5|. We revisit this issue, as well as the rig- 
orous derivation of A^- independent error bounds in place of the expansions (|29p . 
pop in the upcoming publication [ij. Such estimates can also allow a detailed 
analysis on the balance between accuracy and local processor communication 
for PCS such as dni), (HSl) and (EZl). 

5. Hierarchical structure of Fractional Step algorithms and imple- 
mentation on GPUs 

The fractional step framework allows a hierarchical structure to be easily 
formulated and implemented, which is a key advantage for simulating in par- 
allel architectures with complex memory hierarchies and processing units. The 
Graphical Processing Unit (GPU) architecture is inherently different from a tra- 
ditional CPU architecture. GPUs are massively parallel multi-threaded devices 
capable of executing a large number of active threads concurrently. A GPU 
consists of multiple streaming multiprocessors (MP), each of which contains 
multiple scalar processor cores. For example, NVIDIA's C2050 GPU architec- 
ture contains 14 such multiprocessors, each of which contains 32 cores, for a 
total of 448 cores which can handle up to 24k active threads in parallel. A GPU 
has several types of memory which are differently organized compared to the 
traditional hierarchical CPU memory, most notably the main device memory 
(global memory) shared between all the multiprocessors and the on-chip mem- 
ory shared between all cores of a single multiprocessor (shared memory). The 



A^Lie ^ Cl('?)A^Random 



(43) 




(44) 
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memory sizes and access speeds depend on the type of GPU. For instance, the 
memory size of the NVIDIA C2050 GPU is 3GB while the memory size of the 
NVIDIA C2070 GPU is 6GB. 

From the perspective of a GPU programmer writing a code for NVIDIA 
GPU's, the GPU is treated as a co-processor to the main CPU. Programs are 
written in C and hnked to the CUD A hbraries. A function that executes on 
the GPU, called a GPU kernel, consists of multiple threads executing code in 
a single instruction, multiple data (SIMD) fashion. That is, each thread in a 
GPU kernel executes the same code, but on different data. Further, threads 
can be grouped into thread blocks. This abstraction takes advantage of the fact 
that threads executing on the same multiprocessor can share data via on-chip 
shared memory, allowing some degree of cooperation between threads in the 



same block j32| . A major drawback in GPU programming is the slow com- 



munication between GPU global memory and the main memory of the CPU, 
compared to the communication within a GPU. Programmers address this prob- 
lem by maximizing the amount of arithmetic intensive computations performed 
on GPU, minimizing the communication between CPU and GPU, and allowing 
the communication latency to be hidden by overlapping with execution. Com- 
munication among CPUs, although costly, is enabled by APIs such as OpenMP 
and features available in CUDA 2.2-1- such as portable pinned memory, when 
the communication is among GPUs connected to the same shared-memory com- 
puter node. When the communication takes place among GPUs across nodes of 
a cluster, message passing paradigms such as MPI can serve the same scope. 

In our parallelization of the KMC method, we redefine the data structures 
to represent lattice sites in the simulation so that the whole simulated system 
is cut into equal-sized black and white coarse cells like a chessboard in ([9|) . For 
instance. Fig. 1(a) shows a simple example in which we map a 4 x 4 lattice sites 
into 2x2 cells, each cell containing 2x2 sites. One GPU thread is assigned to 
one cell. Coverage information of the whole lattice is stored in an array located 
in the GPU global memory so that all the threads can access the information 
related to their neighboring sites across MPs. The GPU kernel performing the 
KMC simulation over the whole lattice by using the Lie scheme ([T^ and the 
decomposition ([TU]) . is sequentially launched twice for each synchronization time 
step At to work on the black and white cells respectively. The execution times 
for lattices of different sizes are compared in Fig[2j where we take as a reference 
a sequential KMC-kernel, which is a direct numerical implementation of ^ and 
([2]). The same kernel is then used for the implementation on GPUs where we 
compare times for different choices of At. Wc remark that the KMC kernel is 
not optimized by techniques such as the BKL algorithm, [l^ , which is also 
manifested in the scaling with respect to the size of the lattice TV. However, 
the same kernel is used in the Fractional Step algorithm thus here we present 
comparisons between the same KMC algorithms, one serial and one parallelized 
by the Fractional Step approach. Clearly any optimized KMC kernel can be 
used without difficulty in our framework. 

The size of lattices that can be simulated on a single GPU is limited by 
memory, thus in order to simulate large systems it will be necessary to employ a 
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GPU and Sequential code execution time 
10 % ^ ^ ■ ^ ^ 




lattice size 

Figure 2: Execution time of the fractional step KMC for lattices of different sizes. The 
comparison with the sequential algorithm (top curve) is based on the same SSA KMC imple- 
mentation which, however, does not have the optimal complexity of the BKL algorithm. The 
simpler implementation of the SSA algorithm was used. The simple implementation has the 
complexity 0{N^), where TV is the total number of lattice sites. This complexity is reflected 
in the indicated scaling (the slope in the log- log plot). Note that the due to partitioning of the 
lattice in the fractional step algorithm the same KMC kernel will scale as 0{N) only, which 
is in agreement with the observed slope in the plots. 
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cluster of GPUs communicating, for instance, through an MPI protocol We wih 
demonstrate next how Fractional Step KMC algorithms can be tailored to an 
architecture that involves multiple GPUs. We return to the formulation in PU)) . 
and consider the sub- lattice decomposition In this formulation each one of 
the coarse-cells or are simulated on a single GPU. Within each one of 



the GPUs we have the same lattice decomposition as in ©, see Figure 1(b) 
namely 

= U O U D^,f U U D^O , (45) 
1=1 1=1 

and similarly we define a decomposition for C^. Each one of the (sub-)sub- 
latticcs D^f and D!^f corresponds to individual threads within the GPU. Next, 
([9|) and (|45p define nested sub-lattices, which yield a hierarchical decomposition 
of the operator C into ([T0|) and 



1=1 1=1 

and similarly we also define the decomposition for C'^^. Finally, schemes such as 
p2|) and pH)) give rise to Fractional Step algorithms based on the nested decom- 



positions (fTO|) and (|46|) . In this case, boundary communication, see Fig. 1(b) 
plays a key role in the parallelization of our algorithm when multiple GPUs are 
required. As we discussed earlier, this scenario happens when the lattice size 
grows to the point that the lattice data structures no longer fit into a single GPU 
global memory. In turn, this threshold depends on the type of GPU used, e.g., 
for a NVIDIA's C2050 GPU the maximum lattice size is currently 8, 182 x 8, 182 
cells. To simulate larger systems, we can decompose the domain into regular 
sub-domains and distribute both the sub-domain cells and associated compu- 
tation among multiple GPUs, as discussed in (j46p . Boundary communication 



between two adjacent sub-domains are exchanged between GPUs, sec Fig. 1(b) 
and supported by either MPI or OpenMP, depending on the fact that the GPUs 
are located on the same cluster node or across nodes. Thus, the multi-GPU par- 
allel KMC algorithm is based on and benefits from the hierarchical structure of 
the Fractional Step KMC algorithms discussed in (|46|) . At the same time, it can 
enable the scalability of our simulations to lattice sizes beyond the ones acces- 
sible with a single GPU e.g., 8, 182 x 8, 182 sites in a C2050 GPU. The study of 
performance and scalability of our multi-GPU algorithm and code for different 
lattice sizes and types of GPU clusters is beyond the scope of this paper. 



6. Mass Transport and Dynamic Workload Balancing 

Due to the spatially distributed nature of KMC simulations and the depen- 
dence of jump rates on local coverage, ((1}, fractional step algorithms may have 
an imbalance in the number of operations /jumps performed in each coarse cell 
Cm in ([9]), as well as on the corresponding processors. In fact, formulas ([T]) and 
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Figure 3: (a) Workload imbalance in ID unimolecular reaction system: the top figure depicts 
local coverage, the bottom figure workload distribution; (b) Workload redistribution in Figure 
(a) using the mass transport for re-balancing. 



and the very structure of the fractional step algorithms pH)) . allow us to 
define the workload WnAt{<^) = W^TiAt(™; cr), 1 < m < M as 

WnAtim) = # jumps in Cm during [{n - l)At,nAt] , (47) 

when the configuration at time (n — l)At is a. We also renormalize WnAt 
(and still denote it with the same symbol) in order to obtain a histogram, 
i.e., a probability density. Since different coarse cells Cm in the fractional step 
algorithms such as ([T^ or (|16p do not communicate during intervals of length 
At the quantities (|47p are easy to keep track on-the-fly during the simulations. 
The possibility of workload imbalance is depicted in Figure [Sj where many more 
jumps are performed in the processors corresponding to cells of low coverage, 
while the other processors remain idle. 

In this Section we introduce a probabilistic strategy to re-balance the work- 
load WnAt dynamically during the simulation based on the following idea from 
Mass Transport methods, e.g., [loj . One wants to transport the "imbalanced" 
density WnAt into an almost uniform density over the number of processors 
used, in order to ensure that they remain as uniformly active as possible. The 
mass transport connection and terminology refers to the mapping of a given 
probability measure into a desirable probability measure. Typically, [lo| . this 
problem is posed as an optimization over a suitable cost functional and is known 
as the Monge-Kantorovich problem. In our context the cost functional could 
reflect constraints related to various parallel architectures. 

We can formulate and implement this strategy in several different ways: 
probably the simplest approach, that serves mostly as an illustration, is to 
assume that we have a number of processors P, where P <^ M; during the 
interval [{n — l)At,nAt] a number of coarse cells Cm, I < m < M, which 
are simulated independently in a fractional step algorithm, arc allocated to 
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each processor. By the end of the simulation time nAi the workload on all 
processors is described similarly to (|47)) . by a histogram i?„At(f) = RnAt{l\cr), 
1 < I < P. One wants to map (|T7)) onto a histogram RnAt which is almost 
uniform in 1 < / < P. One such function can be constructed by mapping the 
mass corresponding to each value of the cumulative distribution function (cdf) 
of (|T7)) . onto an equal mass on the uniform distribution over the P processors. 
In another implementation of the mass transport method we can adjust the 
size of the coarse cells Cm according to the workload redistribution strategy 
discussed earlier, see Figure [31 This is effectively a one-dimensional example of 
an adsorption/desorption process where the mass transport procedure is carried 
out by mapping ((17)) into a new histogram i?„At(o') = RnAtil'^o') corresponding 
to a new set of variable size coarse cells Ci, 1 < I < M'. The cell size adjustment 
ensures the uniformity of the new histogram by defining RnAt as a mapping of 
the cdf corresponding to (|T7)) . 

The mass transport mappings discussed above are not expected to be carried 
out at every time step nAt in order to reduce computational and communication 
cost, but instead they should follow a rationally designed coarser- in-time sched- 
ule, in analogy to processor communication scheduling, e.g., ((20)) . The overall 
implementation appears rather simple since here we demonstrated the method- 
ology in a one-dimensional example. However, in higher dimensions, adjusting 
the size and shape of coarse cells Cm can be much harder. Nevertheless the 
structure of re-balancing procedure can remain one-dimensional even in higher 
dimensional lattices if we pick a sub-lattice decomposition ((5)) into strips Cm- 
We note that the mapping we constructed using cdf 's did not take into account 
the processor architecture and a suitable cost functional formulation for the 
mass transport to a uniform distribution, as in the Monge-Kantorovich prob- 
lem, [l^l, may be more appropriate. We will revisit such issues in a future 
publication. 

7. Parallel Simulations: Benchmarks and Applications 

Exactly solvable models of statistical mechanics provide a test bed for sam- 
pling algorithms applied to interacting particle systems. We present benchmarks 
for two important cases: (a) sampling of equilibrium distributions, i.e., long time 
behavior of the simulated Markov process, and (b) weak approximations of the 
dynamics. In the first set of tests we work with the classical Ising model on one 
and two dimensional lattices where spins interact through a nearest-neighbor 
potential. Thus the Hamiltonian of the system is 

x^An \y — x\ — l xGAn 

where _ftr is a real parameter that defines the strength of the interaction and 
h the external field. We work with the spin-flip Arrhenius dynamics with the 
rates defined in the nearest-neighbor set = {z\\z — x\ = 1} and the updates 
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(a) Coverage 



(b) Two-point spatial correlations 



Figure 4: (a) Comparison of the exact solution II51I I (solid line) for the total coverage cp{K, h), 
K = 1, with the mean coverage obtained in simulations on the one-dimensional lattice with 
A'^ = 2^^ and At = 1.0. (b) Two-point spatial correlation function estimated at h = 1 on the 
same lattice and At = 1.0 compared to the exact solution. 



in = {0, 1}. 

c(x, a) = ci(l - a(.T)) + C2cr(a;)e-'^^(=^) , (48) 
U{x) ^ K ^ a{x + y) + h , (49) 

with /3 is a given inverse temperature. The generator of (|48|) is a self-adjoint op- 
erator on the space L^{S, /ijv) where ^N{da) ~ Z^^e^^^^'^"^ da is the canonical 
Gibbs measure of the system at the constant inverse temperature /3. Conse- 
quently the dynamics is reversible and the measure /it of the process {St}t>o 
converges to the Gibbs measure iin as t — > oo. Thus the dynamics (|48| can be 
used for computing expected values [/] by invoking ergodicity and averaging 
on a single trajectory 

E^J/]^ / /(a)Miv(dfT)- lim ^ f f{St)dt. 
Js ^ Jo 

In the simulations we estimate two observables: 

mean coverage: Ct = -r-. — rE[ <^t[x)\ , 

2-point correlation function: At (a;, y) = E[at{x)at{x + y)] . 

Due to translational invariance the function Xk{x,y) depends on the distance 
|x — ?/| only. For exactly solvable one and two dimensional Ising models we have 
explicit formulas which we summarize here for the spins in S = {0, 1}. 

ID Ising model: The one-dimensional Ising model does not exhibit a phase 
transition and thus presents a simple benchmark for accuracy. Working with 
lattice gas models requires a simple transformation of the well-known exact 
solution, 0, which for the Hamiltonian of the system given on the periodic 



23 



(a) Coverage 



(b) Two-point spatial correlations 



Figure 5: (a) Comparison of the exact solution (1 541 1 (solid line) for the total coverage cpi^K, h), 
h = 2, with mean coverage obtained in simulations on the one-dimensional lattice with N = 
128 and various At's. (b) Spatial two-point correlation function in the two-dimensional Ising 
model simulated on the lattice N = 512^ at a sub-critical temperature f) > fic. and supercritical 
regime f) < fic- The simulation confirms the behavior obtained from the infinite volume exact 
solution: at high temperatures the decay is exponential while at temperatures below the 
critical temperature the decay is algebraic. The dashed line represents the fitted function of 
the form k~'^e~^l^ . 



N 



N 



lattice 

H{(t) = -K ^ (7{x)a{x + l) + h^ a{x) , 

x—1 x—1 

yields the equilibrium mean coverage and the 2-point correlation function 




sinh(/i') 



sinh^h') + e-^^') 
l + e4^'sinh2(/i')) X 



1 /2 ' 

e^'cosh(/i') - e-^" (l + e4^'sinh2(/i')) 



ish(/i') + e-K' (1 + e4^'sinh^(/i')) 



1/2 



where 



K' = ^f3K , and h' = ^/3{h - K) 



(50) 
(51) 

y>x, (52) 
(53) 



Since the one-dimensional Ising model does not exhibit a phase transition it 
allows us to assess the accuracy of the approximation for the phase diagram 
calculation. The phase diagram depicting dependence of the coverage on the 



external field for different values of /3 is shown in Figure 4(a) In this simulation 
a rather conservative At =1.0 was chosen. The statistical errors (confidence in- 
tervals) are below the resolution of the graph. As seen in the figure the isotherms 




Figure 6: (a) Estimated equilibrium distributions of the coverage process at the two tem- 
peratures simulated in Fig. |5(b)[ (b) Autocorrelation functions for the coverage process in 
the two-dimensional Ising model simulated at /3 = 1.5 (high temperature above the critical 
temperature /3c and at /3 = 1.78 > /3c (low temperature), see parameters in Fig. |5(b)] 



for the average equilibrium coverage are thus obtained with a good accuracy. 
As a global observable the total coverage is less sensitive to statistical errors 
therefore we also monitor the 2-point correlation function and its agreement 
with the exact solution ([5^ . The results for different values of /3 in Figure [4(b) | 
demonstrate good accuracy. 

2D Ising model: The phase transition that occurs in two-dimensional Ising model 
presents a more challenging test case. However, the celebrated exact solution 
due to Onsager for spins E — { — 1, 1}, in the case with the zero external 
field and further refinements yield closed formulas for the mean coverage and 
two point correlation functions. We restrict our tests to the isotropic case, i.e., 
on the two-dimensional periodic lattice we have the Hamiltonian 

H{a)=-K ^ (cr(a;i,a;2)cr(xi,a;2 + 1) + 
a{xi,X2)<7{xi + 1,X2)) + h ^ a{x) . 

Transforming the exact solutions for the spins S = {0, 1} we obtain the equiv- 
alent to the zero external field the value h = 2K at which value the critical 
inverse temperature solves sinh(i/3cif) = 1- The exact solution for the mean 
coverage has the form 

„„.^n-' ['-'^-'^-'l'")- (54) 

P <l3c- 

The exact solution for the 2-point correlation is available in (STj . however, we 
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Figure 7: (a) A sample path of the total coverage process {St} simulated at At = 1.0 and 
At = 0.1 on the one-dimensional lattice with A'^ = 2^^ and At = 1.0. (b) Autocorrelation 
function of the coverage process. The means were obtained from A4 = 1000 independent 
realizations of the process at /3 = 4 and h = I. The inset shows error bars for the empirical 
mean estimator. 



use only the asymptotics in \x — y\, [sj. Introducing k ~ {smh(^/3K)) ^ we have 



^^""'^^-^ 0(^-1-^1/2), /3<- ^^^^ 



The phase diagram is computed ai h = 2 which for if = 1 corresponds to 
the regime when the second-order phase transition occurs at the critical temper- 
ature smhiJ^Kfic) = 1. Sampling the coverage exhibits well-known difficulties 
close to the critical point /3c which are not cured by the fractional step algo- 



rithm. Instead, we demonstrate in Figure 5(a) that for wide range of choices 
the phase diagram is constructed accurately for /3 outside a neighborhood 
of /3c- Close to the critical point the algorithm provides approximations that 
are in agreement with other Monte Carlo sampling approach. The finite-size 
effects are pronounced at the neighborhood of the critical point due to algebraic 
decay of correlations. Thus it is not expected that a good agreement with the 
infinite volume exact solution will be observed in the finite size simulations. 
Nonetheless, the presence of the second-order phase transition is indicated in 
the computed phase diagram. Furthermore, the proposed algorithm provides 
an efficient implementation that allows for simulations on large lattice. It is 
shown in Figure |5(b)| that algebraic decay of the 2-point correlation function is 
well approximated in the low-temperature (sub-critical) regime, while at super- 
critical temperatures the exponential decay is observed. Overall, we note that 
such long-time sampling of the simulated CTMC is a particularly challenging 
task since in principle, errors from any approximation may accumulate at long 
times and contaminate the simulation. 
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Table 1: An Event in fix, a;"" € fix is a randomly selected site from the nearest- neighbor set 
of X, and r2[x) = — a{x)'^)ug, r3{x) = ■^cr(x){l + o-(x))u^j^, r4,{x) = ■ga{x)[a{x) — , 
where is the number of nearest neighbors (n.n.) of x that are equal to k. 





site 


a{x) 




Rate c(a;, oj; cr) 


Comment 


1 


vacant 





^ 1 


fci(i"M^))') 


CO adsorb 


2 


vacant 





0^-1 


(l-fci)r2(x) 


O2 adsorb 








0^-1, x"" 






3 


CO 


1 


1 -> 




CO + and desorb 








-1 ^ 0, x"" 






4 





-1 


-1^0 


k2rA{x) 


CO + and desorb 








1 ^ 0, x"" 







Studying approximation properties of the stochastic dynamics poses a more 
difficult task due to the lack of an exact solution for the evolution of ob- 
servables. Certain guidance can be obtained from mean-field approximations, 
however, those do not give sufficiently good approximation for Ising model in 
low dimensions. Therefore we compare the evolution of the coverage obtained 
from the traditional SSA algorithm with approximations generated by the pro- 



posed fractional time step algorithm with different choices At. In Figure 7(a) 
we compare the expected value and variance of the total coverage process 
^* ~ I Aw I Sajv '^*(^)- Furthermore, it is also shown that the auto-correlation 
function for the process Ct is well-approximated and approximations converge 
as At ^ 0, see Figures [7(E)] and |6(b) [ 

7.1. Examples from Catalysis and Reaction Engineering 

In order to demonstrate the applicability of the proposed parallelization 
methodology in systems exhibiting complex spatio-temporal morphologies at 
mesoscopic length scales, e.g., islands, spirals, rings, etc., we implement a KMC 
algorithm arising in the modeling of chemical reaction dynamics on a catalytic 
surface. Here we focus on CO oxidation, which is a prototypical example for 
molecular-level reaction-diffusion mechanisms between adsorbates on a surface. 
We note that molecular dynamics simulations have also been ernployed to un- 
derstand micro-mechanisms on surfaces such as reaction paths 130|. However, 
reaction kinetics for mesoscale adsorbate structures cannot be simulated by us- 
ing molecular dynamics because of spatio-temporal scale limitations of such 
methods, while KMC methods, have the ability to simulate much larger scales 

In KMC models for CO oxidation on a catalytic surface spatial resolution 
is a critical ingredient of the modeling since in-homogcneously adsorbed O and 
CO react on the catalytic surface only where the corresponding phases meet. 
Sophisticated KMC models for CO oxidation on catalytic surfaces, where kinetic 
parameters are estimated by ab initio density functional theory (DFT), 
were recently developed in [31[ and later in 27[, 2l|. Such KMC models yield 
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Figure 8: Snapshot at different simulation times for the CO oxidation process, on a two- 
dimensional lattice N = 1024^ . 

a remarkable agreement with experiments, see also the review articles (26| and 
i- 

Next we demonstrate the performance of parallel Fractional Step algorithms 
for KMC simulation to heterogeneous catalysis. We implement a siniplified CO 
oxidation model known as the ZifF-Gulari-Barshad (ZGB) model, [39j, which 
was one of the first attempts towards a spatially distributed KMC modeling in 
reaction systems. Although a simplified model compared to the ab initio KMC 
models described earlier, it incorporates the basic mechanisms for the dynamics 
of adsorbatc structures during CO oxidation on catalytic surfaces: single site 
updates (adsorption/dcsorption) and multi-site updates (specifically, reactions 
with two sites being involved). The spins take values cr(a;) = denoting a vacant 
site X G Apf, a{x) = —1 for a molecule CO at x, and cr(x) = 1 representing a 
O2 molecule. Depending on the local configurations of the nearest neighbors in 
Tlx = {y \ \y — x\ = 1} the events in Table[l]are executed. The rates of individual 
events depend on the states in fl^ which are enumerated by w = {1, 2, 3, 4} and 
are summarized in Table [TJ 

The execution times for lattices of different sizes are compared in Figure |2J 
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Figure 9: Evolution of the mean coverage for species in the oxidation process (CO, O2, and 
vacant sites). 



while a snapshot of the spatial morphology is depicted in Figure HI Here we take 
as a reference the sequential KMC-BKL kernel. The same kernel is then used 
for the implementation on GPUs where we compare times for different choices of 
At. We remark that the KMC kernel is not optimized by techniques such as the 
BKL algorithm, [l^ , which is manifested in the scaling with respect to the size of 
the lattice N. However, the same kernel is used in the fractional step algorithm 
thus we present fair comparisons between serial and parallel solvers, noting that 
any optimized serial KMC algorithm can be used as a kernel in our Fractional 
Step framework. It is worth noting that by partitioning of the problem into the 
subproblems the 0{N'^) complexity of the simple implementation for the SSA 
algorithm is reduced, which is also demonstrated in Figure [5] where the slope of 
lines for simulations using GPUs suggest the reduced complexity of order 0{N). 
Hence the proposed approach also offers a simple but efficient implementation 
of KMC simulators. 

Finally, in our implementation (as well as in the original ZGB model) we 
did not implement the fast diffusion mechanism of O adsorbatcs on the surface. 



2l| . However, the scheme (jl6p can allow us to easily implement within our 



parallelization framework schemes with disparate time-scales which turn out to 
be important for the long-time adsorbate dynamics. 
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8. Conclusions 



In this paper wc proposed a new framework for constructing parallel algo- 
rithms for lattice KMC simulations. Our approach relies on a spatial decompo- 
sition of the Markov generator underlying the KMC algorithm, into a hierarchy 
of operators corresponding to processors' structure in the parallel architecture. 
Based on this operator decomposition, we can formulate Fractional Step Ap- 
proximation schemes by employing the Trotter product formula; these schemes 
allow us to run independently on each processor a serial KMC simulation on 
each fractional time-step window. Furthermore, the schemes incorporate the 
Communication Schedule between processors through the sequential applica- 
tion of the operators in the decomposition, as well as the time step employed in 
the particular fractional step scheme. Here we discussed deterministic schedules 
resulting from Lie- and Strang-type fractional step schemes, as well as random 



schedules derived by the Random Trotter Theorem, [18| . We demonstrated that 
the latter category includes the algorithm |34| as one particular example. 

Some of the key features of the proposed framework and possible future 
directions include: The hierarchical structure can be easily derived and imple- 
mented for very general physiochemical processes modeled by lattice systems, 
allowing users to input as the KMC kernel their preferred serial algorithm. This 
flexibility and hierarchical structure allow for tailoring our framework to partic- 
ular parallel architectures with complex memory and processor hierarchies, e.g., 
clusters of CPUs communicating, for instance, through an MPI protocol, and 
using the nested generator decomposition pS)) . Moreover, multi-scale Trotter 
algorithms for systems with fast and slow processes arc widely used in Molecular 
Dynamics, e.g., [l^ . and they can be recast along with the proposed methods 
into a spatio-temporal hierarchy of operators that allow computational tasks to 
be hierarchically decomposed in space/time. The numerical consistency of the 
proposed algorithms is rigorously justified by Trotter Theorems, [H, m show- 
ing the convergence of our approximating schemes to the original serial KMC 
algorithm. Related numerical estimates are expected to provide insights on 
the design and the relative advantages of various communication schedules and 
architectures. We discussed work load balancing between processors through 
a re-balancing scheme based on probabilistic mass transport methods that is 
particularly well-suited for the proposed fractional step KMC methods. We 
carried out detailed benchmarking using analytically available exact solutions 
from statistical mechanics and applied the method to simulate complex spatially 
distributed molecular systems, such as reaction-diffusion processes on catalytic 
surfaces. Finally, we studied the performance and scalability of our algorithm 
([46|) and the resulting code for different lattice sizes and types of CPUs. 

Concluding we note that there are some interesting conceptual analogies 
between the parallelization and coarse-graining algorithms of KMC such as the 
Coarse-Grained Monte Carlo (CGMC) method e.g., [l^jQ- both methods we 
decompose the particle system in components communicating minimally, e.g., 
(fTO|) . (fT2|) . or trivially as in coarse-graining methods, thus, local information is 



represented by collective (coarse) variables, or computed on separate processors 
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within a parallel architecture. An early work towards parallelizing CGMC [14 1 
in problems with locally well-mixed particle interactions is [s^, while further 
progress towards understanding and exploiting the analogies and the comple- 
mentarity of CGMC and parallel KMC has the potential to give efficient KMC 
algorithms capable of simulating complex systems at mesoscopic length scales. 
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